1 Introduction

This is a tutorial for learning basic principles of programming with R, and for making some figures that show temporal patterns.

You will only need R running on your computer, and a csv provided by the professor. The CSV is from a NOAA sensor collecting temperature data in Chapel Hill. I am also using RStudio to facilitate the programming with R.

You will learn:

  1. How to import data in R.
  2. R data types.
  3. How to change text values to date format
  4. How to edit values and use color palletes
  5. How to use plotly to make interactive plots.

Link to Video

the_file <- "/Users/javierarce/Downloads/TMAX_chapel_hill.csv" ## identify where the file is located in your computer and change the value accordingly
## On a mac: find the file in finder and then right-click the file, and then press option. This should show you that you can copy the file as pathname. Then paste the value in place ofu see my original file path.
## On a PC: find the file using File Explorer. Press the Shift key and then right-click on the file. You should be able to see the Copy Path option. Then paste the value in the code in place of my original file path. BE SURE TO CHECK the direction of the slashes. Notice that the slash is facing forward in my example, and that is the way that R understands it. PCs often put the slashes backward. 

df <- read.csv(the_file) # open a csv
## when R reads a CSV it usually creates a data.frame which is the name that R gives to tables

## I originally got this data from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
## you can also get the data for a particular station and time at https://www.ncdc.noaa.gov/cdo-web/

head(df) ## we use head() to see the column names and the first 6 rows of the data.frame we opened. We could also use tail() to see the last six rows
##      X.3 US1MISW0005 X20180101 PRCP  X0  X X.1 N X.2
## 1  21062 USC00311677  20180101 TMAX -33 NA  NA 7 800
## 2 116823 USC00311677  20180102 TMAX -28 NA  NA 7 800
## 3 213478 USC00311677  20180103 TMAX  -6 NA  NA 7 800
## 4 310682 USC00311677  20180104 TMAX   0 NA  NA 7 800
## 5 505407 USC00311677  20180106 TMAX -11 NA  NA 7 800
## 6 600882 USC00311677  20180107 TMAX -33 NA  NA 7 800

Notice that the column names are not very nice. Let’s change that. But before we do that let’s learn some basic concepts in R.

We need to learn the concept of vectors. You could think of vectors as “a list of values with the same data type.” Even though R also has the concept of lists, in class we will be mostly dealing with vector and data.frame containers when using R. I am going to create a few vectors here.

Link to Video

myLetters = c("G","E","O") ## a vector with 3 characters
myNumbers = c(4,5,6) # a vector with 3 numbers 

You should notice the function c that is used to create the vectors (c means create). You can check the data type and class of each of these vectors by using the method class.

class(myLetters)
## [1] "character"
class(myNumbers)
## [1] "numeric"

Some classes can be changed to other classes using methods such as as.integer, as.numeric, as.list, as.data.frame …..

# it is easy to change a numeric class to a character 
as.character(myNumbers)
## [1] "4" "5" "6"
# but it is not straightforward to change the character to a number 
as.numeric(myLetters)
## Warning: NAs introduced by coercion
## [1] NA NA NA
# you can change vectors to data.frames
as.data.frame(myLetters)
##   myLetters
## 1         G
## 2         E
## 3         O
## you should see that the vector is now a column. 
## in fact, I think it is a useful thing to think about vectors as columns in a table. 
## Let's make a data.frame based on the two vector that we have created. 
## the vector myLetters will have the column name col1 and the numbers123 vector will have the name col2
test = data.frame(col1 = myLetters, col2 = myNumbers )
test
##   col1 col2
## 1    G    4
## 2    E    5
## 3    O    6

Link to Video

In any language there will be ways to assign a number to each element within a vector, table, data.frame, array, list, raster, object, etc. In R, Python and JS this is called indexing. Hence indexing is an important concept for you to remember.

In R, vectors have only one dimension. If we want to select the first element in the vector myLetters, we would write myLetters[1]. The fact that the first element has index [1] is one of the major differences with other programming languages. As we will see, for Python and for JS the first element will be [0].

In the case of data frames, we have 2 dimensions, the rows and the columns. To get an element in a data frame we write the number of the row, then a comma and then the column. For example, to get the second row and first column of our test data frame we write test[2,1]. We can also select entire rows or columns of a data.frame.

myLetters[1]## selecting one element in a vector
## [1] "G"
test[2,1] ## selecting the cell in the second row and first column
## [1] E
## Levels: E G O
test[2,] ## selecting the second row of the data.frame test. 
##   col1 col2
## 2    E    5
test[,2] ## selecting the second column of the data frame test, you could also use test[2] but I prefer to use the comma to remind me that I am dealing with columns 
## [1] 4 5 6
test$col2 ## selecting the first column by using the name of the column after a dollar sign ($) 
## [1] 4 5 6

Link to Video

Above, we saw that our data.frame had some weird column names and some columns with useless data. Let’s start getting rid of those columns. Let’s look at the first six rows of the data frame:

head(df)
##      X.3 US1MISW0005 X20180101 PRCP  X0  X X.1 N X.2
## 1  21062 USC00311677  20180101 TMAX -33 NA  NA 7 800
## 2 116823 USC00311677  20180102 TMAX -28 NA  NA 7 800
## 3 213478 USC00311677  20180103 TMAX  -6 NA  NA 7 800
## 4 310682 USC00311677  20180104 TMAX   0 NA  NA 7 800
## 5 505407 USC00311677  20180106 TMAX -11 NA  NA 7 800
## 6 600882 USC00311677  20180107 TMAX -33 NA  NA 7 800

I would like to keep the third column that has the date (X20180101) and the fifth column that has the temperature (X0). I could either delete the columns that I do not want, or make a new data frame with the columns I am interested in. I will show you both ways.

## making a new df with the columns that I want.
df1 = data.frame(date = df$X20180101, TMAX = df$X0)
## another alternative
df2 = df[,c(3,5)] # selecting column 3 and 5 and saving it as df2
names(df2) = c("date","TMAX") ## naming the columns

df3 = df ## making a copy just in case I get it wrong 
df3[,c(1,2,4,6:9)] = NULL ## deleting columns 1,2,4 6 to 9. Hence keeping 3 and 5
names(df3) = c("date","TMAX") ## naming the columns

Link to Video

df3 <- df3[order(df3$date),] ## I am making sure that the data.frame is sorted by date
##I am using a function named order. This function sorts the data, however I want to order all the rows in df3 based on the order of the column date.  

Link to Video

Now let’s make a simple bar graph to see how the maximum temperature changed. For that we will install two libraries (plotly and colorbrewer).

#install.packages("plotly") ## take out the hash-symbol at the beginning of this line if you want to run this line in R. Hashes are used for making comments. 
#install.packages("RColorBrewer")

## Your software may prompt you to select a CRAN Mirror location. CRAN is the ftp repository where we can obtain R libraries and R software versions. 

## Libraries are files that contain functions for different purposes. The libraries also contain help files and other libraries. In the case of Python, the "equivalent" of CRAN is pip or conda, and for JS the "equivalent" would be npm. For Python and R you usually install the libraries in your computer and call them at the begining of your code by typing in R library(theNameOfTheLibrary) and in Python "import  theNameOfTheLibrary". When calling a JS script inside an HTML file we will call the library by using a script tag and putting the source address such as <script src="/folder/theLibrary.js"></script> or calling the script from a https address or a CDN (Content Delivery Network) such as  <script src="https://unpkg.com/leaflet@1.7.1/dist/leaflet.js""></script>

2 Plots

Link to Video

#After installing the packages you need to open the libraries.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(RColorBrewer)

## now let's make our bar plot using Plotly. 
## If you go to google and type Plotly Bar Chart R you will get this page https://plotly.com/r/bar-charts/
## Here you have a simple example of how to make a bar chart. 


fig <- plot_ly(x = df3$date, y = df3$TMAX, name = "TMAX Chapel Hill 2018",type = "bar")
fig
20.1802M20.1804M20.1806M20.1808M20.181M20.1812M−50050100150200250300350
### During the past two years I used the ggplot2 library.  This year I have decided that the plotly library is better for learning since it can be used in both python and JS. Also, it allows for easy interactivity which is an important part of data analysis when dealing with time and space.  If you are interested in making the figures with ggplot, here is the code:
#p <- ggplot(df, aes(x=date, y=TMAX)) + geom_bar(stat="identity") ## this asks ggplot to make a plot using data from df with the values on the x axis from date and the values on the y axis from TMAX. 
# the + geom_bar(stat = "identity") adds the instruction to make it a bar graph where the heights of the bars represent the y-values
#p ## will print the plot

Congratulations, you have made your first bar plot! If you do not see a plot like the one above, check that you have the plotly libraries installed and that you have opened the csv as a data.frame.

Notice two things.

  1. the vertical axis goes from 0 to almost 400. This is because the temperature values are recorded as decimal centigrade. So you should divide the values by 10 if you want a more normal representation of the temperature values. But for now we will use the numbers as they come from NOAA.
  2. Notice that the bar plots are clustered. This is because we gave R the value dates, hence Jan 31 2018 (20180131) will appear as if it is 70 units away from February 01 2018 (20180201). So, something we should have done from the beginning was check the data types, and change the values in the date column to date type.

Link to Video

df3$day <- as.Date(as.character(df3$date), "%Y%m%d") ## this will create a new column named day by interpreting the values in column date as Year month day. 
## now let's make the plot again with the value day in the x axis
fig <- plot_ly(df3, x = ~day, y = ~TMAX, name = "TMAX Chapel Hill 2018",type = "bar") ## notice also that this time I gave  df3 as the first argument, then assigned the x value by putting  ~day instead of having to write df3$day
fig
Jan 2018Mar 2018May 2018Jul 2018Sep 2018Nov 2018−50050100150200250300350
dayTMAX

Notice that there are some gaps in the data. You could have also noticed that some data was missing by looking at the number of rows in the data and noticing that there are only 322 counts and not 365 counts.

nrow(df3)
## [1] 322
tt = df3[!is.na(df3$TMAX),]
nrow(tt)
## [1] 322

Today I am going to overlook the missing 43 counts as they seem to be missing throughout many months. Missing data is going to be a common problem, and there are different ways to address it, but that is not the purpose of this tutorial.

Link to Video

Let’s get back to making plots and giving these plots some color. Most plotting methods have ways to assign color values. However plotly is not very good at assigning colors, and also I would like you to understand the basics of assigning colors so you can apply it to different software, graphs and maps.

In R I find the library classInt very useful to classify data and assign color. I am going to start with a simple example and then go back to our temperature example.

myLetters = letters[1:5] # making a vector of letters that go from a to e
myNumbers = c( 1,1,1,5 ,3) ## making a vector with numbers 
myNumbers = c(rep(1,3),5,3) ## the same as before but just showing you how to use rep to repeat numbers
myColorPalette = brewer.pal(9,"RdYlBu") ## I selected the 9 colors from the spectral combination from color brewer. You can check the other possible color combinations by checking display.brewer.all()

myColors = colorRampPalette(myColorPalette) ## I am creating a function that will interpolate colors within the palette that I give it. 

myColors(5) ## you should see that myColors is a vector containing the name of colors in hexcode. ## The number is the number of colors that it will generate with palette that you selected. 
## [1] "#D73027" "#FDAE61" "#FFFFBF" "#ABD9E9" "#4575B4"
## Change the number from 5 to other numbers to see how many colors it creates. 
## You can put whatever value you wish, but if the number is too high you will see some colors repeated, since there is a limit on how many different colors can be generated. 
## Let's use five colors since we have 5 bars. 
plot_ly(x = myLetters, y = myNumbers, marker = list(color=myColors(5)), name = "5 Bar test",type = "bar")

***

You should see a bar plot with each letter having a bar with a different color in the spectral range. Let’s say we want to assign a color to a value, so for example a,b and c have the same color and that d,e have another color. We can do this in many ways, and most plotting functions already have ways to do the color interpolation. However, since I plan to export this data.frame to JS to make the figure on the web, I would rather generate a column that contains the color that will be assigned to each value. (Also, certain plotting software (including plotly) has some weird issues with how to assign colors to markers.)

Let’s say you want to use color to represent a value, hence in our previous plot, all the letters with the value 1 should be red (a,b,c). The bar that has the value 3 should have a yellow color (bar-e) and the bar that has a value of 5 (bar-d) should have a blue color.

We could take advantage of indexing in R to do this.

mC = myColors(5) ## creating a 5 color vector
mC[1] ## this should show us the first color in our vector (red) #D73027"
## [1] "#D73027"
mC[3] ## this should show us the third color in our vector(light yellow) #FFFFBF
## [1] "#FFFFBF"
mC[5] ## this should show us the fifth color in our vector (blue) "#4575B4"
## [1] "#4575B4"
mC[myNumbers] ## if you use myNumbers as the index you should see that the colors that you get are assigned to the values. 
## [1] "#D73027" "#D73027" "#D73027" "#4575B4" "#FFFFBF"
## Now let's apply what we have learned to the temperature plot. Notice that I used the same code as before but added myNumbers as an index of myColors(5)
plot_ly(x = myLetters, y = myNumbers, marker = list(color=myColors(5)[myNumbers]), name = "5 Bar test",type = "bar")

Link to Video

In the previous example, assingning the value to the index of the color was relatively straightforward since the values were positive and integers. However in our example of temperatures we have negative values. So we need to make all the values positive by adding the minimum value. Also let’s assign a color to each possible integer from the min value to the maximum.

myScale = min(df3$TMAX)-1 ## min value of TMAX, minus 1 since in R the index starts at 1.
print(myScale) ## this will be the value that will be added sothat all values are between 1 and max + myScale
## [1] -34
print(max(df3$TMAX))## max value of TMAX
## [1] 361
daRange = max(df3$TMAX)-(min(df3$TMAX)-1)
print(daRange) ## range value of TMAX. I will use this to define how many colors to use 
## [1] 395
df3$daColor = myColors(daRange)[df3$TMAX-myScale] ## I am creating a column with the assigned color using the indexing method
##Using the previous temperature graph that we used before, let's do the following:
plot_ly(df3, x = ~day, y = ~TMAX, marker = list(color=df3$daColor), name = "TMAX Chapel Hill 2018",type = "bar") 
Jan 2018Mar 2018May 2018Jul 2018Sep 2018Nov 2018−50050100150200250300350
dayTMAX

Oh no! the red colors are representing cold temperatures, and the blues represent hot. We need to do something about this.

Link to Video

R, Python and JS have methods to reverse the order of the vector. In R the method is rev(), in Python [::-1] , and in JS the method is .reverse().

I am going to first reverse the colorPalette

myColorPalette = rev(brewer.pal(9,"RdYlBu")) 
myColors = colorRampPalette(myColorPalette)
df3$daColor = myColors(daRange)[df3$TMAX-myScale]

plot_ly(df3, x = ~day, y = ~TMAX, marker = list(color=df3$daColor), name = "TMAX Chapel Hill 2018",type = "bar") 
Jan 2018Mar 2018May 2018Jul 2018Sep 2018Nov 2018−50050100150200250300350
dayTMAX

Now this looks much better.

Link to Video

The previous figure reminded me of warming stripes by Ed Hawkins

In warming stripes, Hawkins compares temperature for multiple years at a location, and only uses relative color to show the patterns of temperature change. To mimic this, I will assign the same value of ‘y’ to all the bars, keeping the color coding of the bars based on TMAX.

constY = c(rep(1,length(df3$TMAX))) ## creating a vector of 1s that has the length df3$TMAX. Most JS uses .length, and Python len(). 
plot_ly(df3, x = ~day, y = constY, marker = list(color=df3$daColor), name = "TMAX Chapel Hill 2018",type = "bar")
Jan 2018Mar 2018May 2018Jul 2018Sep 2018Nov 201800.20.40.60.81
day

Link to Video

I don’t like the missing days. So I am going to make days a factor. Factor is an R data type that is used as a category, stored as a level. It is important to remember when working with R data that date types and things that look like numbers or characters could be factors. Not recognizing this that could mess up our analysis. However, in the following case you will see that I will make days behave differently by making them into factors.

fig = plot_ly(x = as.factor(df3$day), y = constY, marker = list(color=df3$daColor), name = "TMAX Chapel Hill 2018",type = "bar") 
fig
2018-01-012018-01-102018-01-192018-01-272018-02-042018-02-122018-02-202018-03-012018-03-092018-03-182018-03-272018-04-052018-04-132018-04-212018-04-292018-05-072018-05-172018-05-252018-06-032018-06-112018-06-202018-07-012018-07-092018-07-182018-07-262018-08-062018-08-142018-08-232018-09-042018-09-122018-09-232018-10-022018-10-102018-10-212018-10-302018-11-092018-11-222018-12-012018-12-102018-12-202018-12-3000.20.40.60.81

Link to Video

I will also eliminate the labels in the x and y axis. For that I will create a variable named noAxisList wich contains all the information that plotly expects as a list.

noAxisList <- list(
  title = "",
  zeroline = FALSE,
  showline = FALSE,
  showticklabels = FALSE,
  showgrid = FALSE
)
fig = fig %>% layout(title = 'Color Bar',   ### here I am using the pipe operator (%>%) to change the figure which we already made
         xaxis = noAxisList,
         yaxis = noAxisList)
fig

Pretty cool!

Link to Video

But let’s make this graph even more fun. While we can see that the x axis is time, it does not gives us a sense of a cycle. We could change the image to invoke the metaphor of time from a clock. (I was also inspired by Andrienko et. al.’s figure on cyclical processes). To give it a cyclical look I will use the polar graph method. Let’s start with plotly’s basic example. In this page you can find details about all the things that you can change in the plot. https://plotly.com/r/reference/barpolar/

fig <- plot_ly(
    type = 'barpolar',
  ) 
fig <- fig %>%
  add_trace(
    r = c(3, 1.5),
    theta = c(45, 0),
    width = c(10,10),
      marker = list(
        color = c("#00FF00","#FF0000")
      )
  )
fig
45°90°135°180°225°270°315°00.511.522.53
trace 1

Notice that a barpolar plot is created by assigning values of radius (distance from the center) and theta (angle). In our case we want to use temperature as radius and the date as the theta, hence we need to convert the day of the month to an angle. We have 360 degrees and 365 days. Hence we can convert the day of the year by dividing the day of the year by 365 and then multiplying that proportion by 360. Let’s do that.

## to get the julian day in R we convert a day to %j 
## This is needed because, for example our 31st row is Feb 2 since we are missing some data. Hence we can not use the row number to determine the date.  
print(df3$day[31])
## [1] "2018-02-02"
## we can use this method
print(format(df3$day[31],"%j")) ## this should give us a value of 33 as character 
## [1] "033"
print(as.numeric(format(df3$day[31],"%j")))  ## this should give us a value of 33 as numeric
## [1] 33
df3$theta = (as.numeric(format(df3$day, "%j"))/365)*360 ## I know we have leap years, but I will not take care of that now!

Now our data frame has all the information to create the bar polar graph.

fig <- plot_ly(
    type = 'barpolar',
  ) 
fig <- fig %>%
  add_trace(
    r = df3$TMAX/10, ## divided by 10 so the values are in actual degrees Celsius
    theta = df3$theta,
    width = constY,
      marker = list(
        color = df3$daColor
      )
  )
fig
45°90°135°180°225°270°315°−505101520253035
trace 1
fig %>% layout(
  polar = list(
    angularaxis = list(
      tickmode  = 'array',
      ticktext = c("Winter","Spring","Summer","Fall"), 
      tickvals = c(0,90,180,270),
      direction = "clockwise", ## to have the days increase clockwise
      rotation = -90 # to have the winter on the bottom
               )
  )
)

What patterns can you see now that were not as obvious when seeing the data horizontally?
What changes could be made to this visualization to make it more elegant and informative?

Link to Video

Now let’s do another version of a plot which has become a very popular way to visualize change in time. The ridgeline plot ( also known as the joyplot… but that is another story) was originally made by a radioastronomer looking at pulsars. The plot shows brightness as a function of time for a periodic process. If you look for #joyplots in twitter you will see many examples.

One of the main beauties of R is that people create libraries with very easy to read documentation. To make a ridgline plot first we download a library and prepare the data.

## I have not found a decent way of doing this with plotly. I can make the distributions in plotly using violin plots but it would take a lot of hacking to get it to assign the colors by temperature. So we will keep using ggridges
#install.packages("ggridges")

library(ggridges) # open the ggridges library. It needs ggplot2 and a recent version of R to work

df3$month <- format(df3$day,"%m") ### from the column day create another column named month that has the number of the month

head(df3)
##       date TMAX        day daColor     theta month
## 1 20180101  -33 2018-01-01 #4575B4 0.9863014    01
## 2 20180102  -28 2018-01-02 #497AB6 1.9726027    01
## 3 20180103   -6 2018-01-03 #5E93C3 2.9589041    01
## 4 20180104    0 2018-01-04 #649AC7 3.9452055    01
## 5 20180106  -11 2018-01-06 #598EC0 5.9178082    01
## 6 20180107  -33 2018-01-07 #4575B4 6.9041096    01

I made the column month because I am going to aggregate the data by month.

ggplot(df3, aes(x = TMAX, y = month, fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  labs(title = 'Temperatures in CH') +
  scale_fill_gradientn(colours = colorRampPalette(rev(brewer.pal(9,"RdYlBu")))(daRange)) ## Notice that ggplot already has a method to assign colors
## Picking joint bandwidth of 21.4

What do you see from this graph that you couldn’t see using the bar graph or the circular graph? What changes could be made to this visualization to make it more elegant and informative?

You can download the R script and date from Sakai: Resources/Data/Chapel Hill

3 Summary

In this tutorial you have learned several R concepts, such as

You have also learned new methods in R.

(See if you can recall what each method does.)